The Efficiency of MapReduce in Parallel External Memory
نویسندگان
چکیده
Since its introduction in 2004, the MapReduce framework has become one of the standard approaches in massive distributed and parallel computation. In contrast to its intensive use in practise, theoretical footing is still limited and only little work has been done yet to put MapReduce on a par with the major computational models. Following pioneer work that relates the MapReduce framework with PRAM and BSP in their macroscopic structure, we focus on the functionality provided by the framework itself, considered in the parallel external memory model (PEM). In this, we present upper and lower bounds on the parallel I/O-complexity that are matching up to constant factors for the shuffle step. The shuffle step is the single communication phase where all information of one MapReduce invocation gets transferred from map workers to reduce workers. Hence, we move the focus towards the internal communication step in contrast to previous work. The results we obtain further carry over to the BSP∗ model. On the one hand, this shows how much complexity can be “hidden” for an algorithm expressed in MapReduce compared to PEM. On the other hand, our results bound the worst-case performance loss of the MapReduce approach in terms of I/O-efficiency.
منابع مشابه
Cloud Computing Technology Algorithms Capabilities in Managing and Processing Big Data in Business Organizations: MapReduce, Hadoop, Parallel Programming
The objective of this study is to verify the importance of the capabilities of cloud computing services in managing and analyzing big data in business organizations because the rapid development in the use of information technology in general and network technology in particular, has led to the trend of many organizations to make their applications available for use via electronic platforms hos...
متن کاملEfficient Parallel and External Matching
We study a simple parallel algorithm for computing matchings in a graph. A variant for unweighted graphs finds a maximal matching using linear expected work and Olog2 n expected running time in the CREW PRAMmodel. Similar results also apply to External Memory, MapReduce and distributed memory models. In the maximum weight case the algorithm guarantees a 1/2-approximation. Although the parallel ...
متن کاملMapReduce for the Cell B.E. Architecture
MapReduce is a simple and flexible parallel programming model proposed by Google for large scale data processing in a distributed computing environment [4]. In this paper, we present a design and implementation of MapReduce for the Cell architecture. This model provides a simple machine abstraction to users, hiding parallelization and hardware primitives. Our runtime automatically manages paral...
متن کاملOn the Complexity of List Ranking in the Parallel External Memory Model
We study the problem of list ranking in the parallel external memory (PEM) model. We observe an interesting dual nature for the hardness of the problem due to limited information exchange among the processors about the structure of the list, on the one hand, and its close relationship to the problem of permuting data, which is known to be hard for the external memory models, on the other hand. ...
متن کاملOn Optimal Algorithms for List Ranking in the Parallel External Memory Model with Applications to Treewidth and other Elementary Graph Problems
The performance of many algorithms on large input instances substantially depends on the number of triggered cache misses instead of the number of executed operations. This behavior is captured by the external memory model in a natural way. It models a computer by a fast cache of bounded size and a conceptually infinite (external) memory. In contrast to the classical RAMmodel, the complexity me...
متن کامل